# **Exercise Session 7**

Cache coerency, Tomasulo, Scoreboard + Expl. Reg. Renaming

Advanced Computer Architectures

Politecnico di Milano May 21st, 2025

Alessandro Verosimile <alessandro.verosimile@polimi.it>





### Recall: Locality Principles

- Programs access a small proportion of their address space at any time
- Temporal locality
  - Items accessed recently are likely to be accessed again soon
  - e.g., instructions in a loop, induction variables
- Spatial locality
  - Items near those accessed recently are likely to be accessed soon
  - E.g., sequential instruction access, array data





### Recall: Memory hierarchy







#### Recall: MIMD Machines

#### Symmetric Multiprocessor

- Multiple processors in box with shared memory communication
- Current MultiCore chips like this
- Every processor runs copy of OS
- Non-uniform sharedmemory with separate I/O through host
  - Multiple processors
    - · Each with local memory
    - · general scalable network
  - Extremely light "OS" on node provides simple services
    - Scheduling/synchronization
  - Network-accessible host for I/O

#### Cluster

- Many independent machine connected with general network
- Communication through messages







### Recall: Memory Address Space Model

Single logically shared address space

Multiple and private address spaces







### Recall: Physical Memory Organization

#### Centralized shared-memory architectures

- at most few dozen processor chips (< 100 cores)</li>
- Large caches, single memory multiple banks
- Often called symmetric multiprocessors (SMP) and the style of architecture called Uniform Memory Access (UMA)



#### Distributed memory architectures

- To support large processor counts
- Requires high-bandwidth interconnect
- Cons: data communication among processors
- Non Uniform Memory Access (NUMA)







#### Recall: Address Space vs. Physical Memory Org.

# Single Logically Shared Address Space (Shared-Memory Architectures)



# Multiple and Private Address Space (Message Passing Architectures)







#### Recall: Address Space vs. Physical Memory Org.



# Multiple and Private Address Space (Message Passing Architectures)







#### Recall: Example Cache Coherency Problem



#### Things to note:

- Processors see different values for u after event 3
- With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value
  - Processes accessing main memory may see very stale value
- Unacceptable to programs, and frequent!





#### **Recall: Potential Solutions**

- HW-based solutions to maintain coherency: Cache-Coherence Protocols
- Key issues to implement a cache coherent protocol in multiprocessors is tracking the status of any sharing of a data block.
- Two classes of protocols:
  - Snooping Protocols
  - Directory-Based Protocols





# Recall: Snooping protocols







### Recall: Basic Snooping Protocols

Snooping Protocols are of two types depending on what happens on a write operation:

Write-Invalidate Protocol

Write-Update or Write-Broadcast Protocol





#### Recall: MESI Protocol

MESI Protocol: Write-Invalidate

Each cache block can be in one of four states:

- Modified: the block is dirty and cannot be shared; cache has only copy, its writeable.
- Exclusive: the block is clean and cache has only copy;
- Shared: the block is clean and other copies of the block are in cache;
- Invalid: block contains no valid data

Add exclusive state to distinguish exclusive (writable) and owned (written)





### Recall: MESI State Transition Diagram





Cache line in the "acting" processor

Transaction due to events snooped on the common BUS

#### **BUS Transactions**

RH = Read Hit

RMS = Read Miss, Shared

RME = Read Miss, Exclusive

WH = Write Hit

WM = Write Miss

SHR = Snoop Hit on a Read

SHW = Snoop Hit on a Write or Read-with-Intent-to-Modify

= Snoop Push

Invalidate Transaction

= Read-with-Intent-to-Modify

†)= Cache Block Fill





Consider the following access pattern on a twoprocessor system with a direct-mapped, writeback cache with one cache block and a two cache block memory.

Assume the MESI protocol is used, with writeback caches, write-allocate, and invalidation of other caches on write (instead of updating the value in the other caches).





#### Recall: Write Policy

- Write-through: all writes update cache and underlying memory/cache
  - Can always discard cached data most up-todate data is in memory
  - -Cache control bit: only a *valid* bit
- Write-back: all writes simply update cache
  - Can't just discard cached data may have to write it back to memory
  - Cache control bits: both valid and dirty bits
    - How to identify the most recent data value of a cache block in case of cache miss?
      - It can be in a cache rather in a memory
    - Can use the same snooping scheme both for cache misses and writes
      - Each processor snoops every address placed on the bus
      - If a processor finds that it has a dirty copy of the requested cache block, it provides the cache block in response to the read request



memory access is aborted



#### Recall: Write Policy

- What happens on write miss?
- Write allocate: allocate new cache line in cache
  - –Usually means that you have to do a "read miss" to fill in rest of the cache-line!
  - –Alternative: per/word valid bits
- Write non-allocate (or "write-around"):
  - –Simply send write data through to underlying memory/cache - don't allocate new cache line!





| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  |                         |                         |                                     |                                     |
| 2     | P0: write block 1 |                         |                         |                                     |                                     |
| 3     | P0: write block 0 |                         |                         |                                     |                                     |
| 4     | P1: read block 0  |                         |                         |                                     |                                     |
| 5     | P1: write block 0 |                         |                         |                                     |                                     |
| 6     | P0: read block 1  |                         |                         |                                     |                                     |
| 7     | P1: read block 1  |                         |                         |                                     |                                     |
| 8     | P0: write block 1 |                         |                         |                                     |                                     |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 |                         |                         |                                     |                                     |
| 3     | P0: write block 0 |                         |                         |                                     |                                     |
| 4     | P1: read block 0  |                         |                         |                                     |                                     |
| 5     | P1: write block 0 |                         |                         |                                     |                                     |
| 6     | P0: read block 1  |                         |                         |                                     |                                     |
| 7     | P1: read block 1  |                         |                         |                                     |                                     |
| 8     | P0: write block 1 |                         |                         |                                     |                                     |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                                 | No                                  |
| 3     | P0: write block 0 |                         |                         |                                     |                                     |
| 4     | P1: read block 0  |                         |                         |                                     |                                     |
| 5     | P1: write block 0 |                         |                         |                                     |                                     |
| 6     | P0: read block 1  |                         |                         |                                     |                                     |
| 7     | P1: read block 1  |                         |                         |                                     |                                     |
| 8     | P0: write block 1 |                         |                         |                                     |                                     |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

20

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                                 | No                                  |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                                  | Yes                                 |
| 4     | P1: read block 0  |                         |                         |                                     |                                     |
| 5     | P1: write block 0 |                         |                         |                                     |                                     |
| 6     | P0: read block 1  |                         |                         |                                     |                                     |
| 7     | P1: read block 1  |                         |                         |                                     |                                     |
| 8     | P0: write block 1 |                         |                         |                                     |                                     |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                                 | No                                  |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                                  | Yes                                 |
| 4     | P1: read block 0  | Shared (0)              | Shared (0)              | Yes                                 | Yes                                 |
| 5     | P1: write block 0 |                         |                         |                                     |                                     |
| 6     | P0: read block 1  |                         |                         |                                     |                                     |
| 7     | P1: read block 1  |                         |                         |                                     |                                     |
| 8     | P0: write block 1 |                         |                         |                                     |                                     |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

|       |                   | P0 cache         | P1 cache      | Memory at block | Memory at block  |
|-------|-------------------|------------------|---------------|-----------------|------------------|
| Cycle | After Operation   | TTER () DEPATION | block state   | up to date?     | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1)    | Invalid       | Yes             | Yes              |
| 1     | P1: read block 0  | Exclusive (1)    | Exclusive (0) | Yes             | Yes              |
| 2     | P0: write block 1 | Modified (1)     | Exclusive (0) | Yes             | No               |
| 3     | P0: write block 0 | Modified (0)     | Invalid       | No              | Yes              |
| 4     | P1: read block 0  | Shared (0)       | Shared (0)    | Yes             | Yes              |
| 5     | P1: write block 0 | Invalid          | Modified (0)  | No              | Yes              |
| 6     | P0: read block 1  |                  |               |                 |                  |
| 7     | P1: read block 1  |                  |               |                 |                  |
| 8     | P0: write block 1 |                  |               |                 |                  |
| 9     | P1: write block 1 |                  |               |                 |                  |
| 10    | P0: read block 0  |                  |               |                 |                  |
| 11    | P1: write block 1 |                  |               |                 |                  |
| 12    | P1: read block 1  |                  |               |                 |                  |
| 13    | P0: read block 1  |                  |               |                 |                  |
| 14    | P1: write block 1 |                  |               |                 |                  |

|       | After Operation P0 cache block state | P0 cache      | P1 cache      | Memory at block Memory at block |     |  |
|-------|--------------------------------------|---------------|---------------|---------------------------------|-----|--|
| Cycle |                                      | block state   | 0 up to date? | 1<br>up to date?                |     |  |
| 0     | P0: read block 1                     | Exclusive (1) | Invalid       | Yes                             | Yes |  |
| 1     | P1: read block 0                     | Exclusive (1) | Exclusive (0) | Yes                             | Yes |  |
| 2     | P0: write block 1                    | Modified (1)  | Exclusive (0) | Yes                             | No  |  |
| 3     | P0: write block 0                    | Modified (0)  | Invalid       | No                              | Yes |  |
| 4     | P1: read block 0                     | Shared (0)    | Shared (0)    | Yes                             | Yes |  |
| 5     | P1: write block 0                    | Invalid       | Modified (0)  | No                              | Yes |  |
| 6     | P0: read block 1                     | Exclusive (1) | Modified (0)  | No                              | Yes |  |
| 7     | P1: read block 1                     |               |               |                                 |     |  |
| 8     | P0: write block 1                    |               |               |                                 |     |  |
| 9     | P1: write block 1                    |               |               |                                 |     |  |
| 10    | P0: read block 0                     |               |               |                                 |     |  |
| 11    | P1: write block 1                    |               |               |                                 |     |  |
| 12    | P1: read block 1                     |               |               |                                 |     |  |
| 13    | P0: read block 1                     |               |               |                                 |     |  |
| 14    | P1: write block 1                    |               |               |                                 |     |  |

|       | After Operation   | After Operation P0 cache P1 cache block state | D1 cacho      | Memory at block Memory |                  |  |
|-------|-------------------|-----------------------------------------------|---------------|------------------------|------------------|--|
| Cycle |                   |                                               | block state   | 0<br>up to date?       | 1<br>up to date? |  |
| 0     | P0: read block 1  | Exclusive (1)                                 | Invalid       | Yes                    | Yes              |  |
| 1     | P1: read block 0  | Exclusive (1)                                 | Exclusive (0) | Yes                    | Yes              |  |
| 2     | P0: write block 1 | Modified (1)                                  | Exclusive (0) | Yes                    | No               |  |
| 3     | P0: write block 0 | Modified (0)                                  | Invalid       | No                     | Yes              |  |
| 4     | P1: read block 0  | Shared (0)                                    | Shared (0)    | Yes                    | Yes              |  |
| 5     | P1: write block 0 | Invalid                                       | Modified (0)  | No                     | Yes              |  |
| 6     | P0: read block 1  | Exclusive (1)                                 | Modified (0)  | No                     | Yes              |  |
| 7     | P1: read block 1  | Shared (1)                                    | Shared (1)    | Yes                    | Yes              |  |
| 8     | P0: write block 1 |                                               |               |                        |                  |  |
| 9     | P1: write block 1 |                                               |               |                        |                  |  |
| 10    | P0: read block 0  |                                               |               |                        |                  |  |
| 11    | P1: write block 1 |                                               |               |                        |                  |  |
| 12    | P1: read block 1  |                                               |               |                        |                  |  |
| 13    | P0: read block 1  |                                               |               |                        |                  |  |
| 14    | P1: write block 1 |                                               |               |                        |                  |  |

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|-------------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                                 | No                                  |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                                  | Yes                                 |
| 4     | P1: read block 0  | Shared (0)              | Shared (0)              | Yes                                 | Yes                                 |
| 5     | P1: write block 0 | Invalid                 | Modified (0)            | No                                  | Yes                                 |
| 6     | P0: read block 1  | Exclusive (1)           | Modified (0)            | No                                  | Yes                                 |
| 7     | P1: read block 1  | Shared (1)              | Shared (1)              | Yes                                 | Yes                                 |
| 8     | P0: write block 1 | Modified (1)            | Invalid                 | Yes                                 | No                                  |
| 9     | P1: write block 1 |                         |                         |                                     |                                     |
| 10    | P0: read block 0  |                         |                         |                                     |                                     |
| 11    | P1: write block 1 |                         |                         |                                     |                                     |
| 12    | P1: read block 1  |                         |                         |                                     |                                     |
| 13    | P0: read block 1  |                         |                         |                                     |                                     |
| 14    | P1: write block 1 |                         |                         |                                     |                                     |

#### @T9: P1: write block 1



**BUS Transactions** 

RH = Read Hit
RMS = Read Miss, Shared
RME = Read Miss, Exclusive
WH = Write Hit
WM = Write Miss
SHR = Snoop Hit on a Read
SHW = Snoop Hit on a Write or
Read-with-Intent-to-Modify



#### @T9: P1: write block 1



Cache line in the "acting" processor

Transaction due to events snooped on the common BUS

#### **BUS Transactions**

RH = Read Hit
RMS = Read Miss, Shared
RME = Read Miss, Exclusive
WH = Write Hit
WM = Write Miss
SHR = Snoop Hit on a Read
SHW = Snoop Hit on a Write or
Read-with-Intent-to-Modify



|       | After Operation   | P0 cache      | P1 cache      | Memory at block Memory at block |                  |
|-------|-------------------|---------------|---------------|---------------------------------|------------------|
| Cycle |                   | block state   | block state   | 0 up to date?                   | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1) | Invalid       | Yes                             | Yes              |
| 1     | P1: read block 0  | Exclusive (1) | Exclusive (0) | Yes                             | Yes              |
| 2     | P0: write block 1 | Modified (1)  | Exclusive (0) | Yes                             | No               |
| 3     | P0: write block 0 | Modified (0)  | Invalid       | No                              | Yes              |
| 4     | P1: read block 0  | Shared (0)    | Shared (0)    | Yes                             | Yes              |
| 5     | P1: write block 0 | Invalid       | Modified (0)  | No                              | Yes              |
| 6     | P0: read block 1  | Exclusive (1) | Modified (0)  | No                              | Yes              |
| 7     | P1: read block 1  | Shared (1)    | Shared (1)    | Yes                             | Yes              |
| 8     | P0: write block 1 | Modified (1)  | Invalid       | Yes                             | No               |
| 9     | P1: write block 1 | Invalid       | Modified (1)  | Yes                             | No               |
| 10    | P0: read block 0  |               |               |                                 |                  |
| 11    | P1: write block 1 |               |               |                                 |                  |
| 12    | P1: read block 1  |               |               |                                 |                  |
| 13    | P0: read block 1  |               |               |                                 |                  |
| 14    | P1: write block 1 |               |               |                                 |                  |

|       |                   | P0 cache      | P1 cache      | Memory at block Memory at block |                  |  |
|-------|-------------------|---------------|---------------|---------------------------------|------------------|--|
| Cycle | After Operation   | block state   | block state   | 0 up to date?                   | 1<br>up to date? |  |
| 0     | P0: read block 1  | Exclusive (1) | Invalid       | Yes                             | Yes              |  |
| 1     | P1: read block 0  | Exclusive (1) | Exclusive (0) | Yes                             | Yes              |  |
| 2     | P0: write block 1 | Modified (1)  | Exclusive (0) | Yes                             | No               |  |
| 3     | P0: write block 0 | Modified (0)  | Invalid       | No                              | Yes              |  |
| 4     | P1: read block 0  | Shared (0)    | Shared (0)    | Yes                             | Yes              |  |
| 5     | P1: write block 0 | Invalid       | Modified (0)  | No                              | Yes              |  |
| 6     | P0: read block 1  | Exclusive (1) | Modified (0)  | No                              | Yes              |  |
| 7     | P1: read block 1  | Shared (1)    | Shared (1)    | Yes                             | Yes              |  |
| 8     | P0: write block 1 | Modified (1)  | Invalid       | Yes                             | No               |  |
| 9     | P1: write block 1 | Invalid       | Modified (1)  | Yes                             | No               |  |
| 10    | P0: read block 0  | Exclusive (0) | Modified (1)  | Yes                             | No               |  |
| 11    | P1: write block 1 |               |               |                                 |                  |  |
| 12    | P1: read block 1  |               |               |                                 |                  |  |
| 13    | P0: read block 1  |               |               |                                 |                  |  |
| 14    | P1: write block 1 |               |               |                                 |                  |  |

| Cycle | After Operation   | P0 cache block state | P1 cache<br>block state | Memory at block<br>0<br>up to date? | Memory at block<br>1<br>up to date? |
|-------|-------------------|----------------------|-------------------------|-------------------------------------|-------------------------------------|
| 0     | P0: read block 1  | Exclusive (1)        | Invalid                 | Yes                                 | Yes                                 |
| 1     | P1: read block 0  | Exclusive (1)        | Exclusive (0)           | Yes                                 | Yes                                 |
| 2     | P0: write block 1 | Modified (1)         | Exclusive (0)           | Yes                                 | No                                  |
| 3     | P0: write block 0 | Modified (0)         | Invalid                 | No                                  | Yes                                 |
| 4     | P1: read block 0  | Shared (0)           | Shared (0)              | Yes                                 | Yes                                 |
| 5     | P1: write block 0 | Invalid              | Modified (0)            | No                                  | Yes                                 |
| 6     | P0: read block 1  | Exclusive (1)        | Modified (0)            | No                                  | Yes                                 |
| 7     | P1: read block 1  | Shared (1)           | Shared (1)              | Yes                                 | Yes                                 |
| 8     | P0: write block 1 | Modified (1)         | Invalid                 | Yes                                 | No                                  |
| 9     | P1: write block 1 | Invalid              | Modified (1)            | Yes                                 | No                                  |
| 10    | P0: read block 0  | Exclusive (0)        | Modified (1)            | Yes                                 | No                                  |
| 11    | P1: write block 1 | Exclusive (0)        | Modified (1)            | Yes                                 | No                                  |
| 12    | P1: read block 1  |                      |                         |                                     |                                     |
| 13    | P0: read block 1  |                      |                         |                                     |                                     |
| 14    | P1: write block 1 |                      |                         |                                     |                                     |

| Cycle | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block Memory at block |                  |
|-------|-------------------|-------------------------|-------------------------|---------------------------------|------------------|
|       |                   |                         |                         | 0<br>up to date?                | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                             | Yes              |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                             | Yes              |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                             | No               |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                              | Yes              |
| 4     | P1: read block 0  | Shared (0)              | Shared (0)              | Yes                             | Yes              |
| 5     | P1: write block 0 | Invalid                 | Modified (0)            | No                              | Yes              |
| 6     | P0: read block 1  | Exclusive (1)           | Modified (0)            | No                              | Yes              |
| 7     | P1: read block 1  | Shared (1)              | Shared (1)              | Yes                             | Yes              |
| 8     | P0: write block 1 | Modified (1)            | Invalid                 | Yes                             | No               |
| 9     | P1: write block 1 | Invalid                 | Modified (1)            | Yes                             | No               |
| 10    | P0: read block 0  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 11    | P1: write block 1 | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 12    | P1: read block 1  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 13    | P0: read block 1  |                         |                         |                                 |                  |
| 14    | P1: write block 1 |                         |                         |                                 |                  |

|       | After Operation   | P0 cache<br>block state | P1 cache<br>block state | Memory at block Memory at block |                  |
|-------|-------------------|-------------------------|-------------------------|---------------------------------|------------------|
| Cycle |                   |                         |                         | 0<br>up to date?                | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                             | Yes              |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                             | Yes              |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                             | No               |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                              | Yes              |
| 4     | P1: read block 0  | Shared (0)              | Shared (0)              | Yes                             | Yes              |
| 5     | P1: write block 0 | Invalid                 | Modified (0)            | No                              | Yes              |
| 6     | P0: read block 1  | Exclusive (1)           | Modified (0)            | No                              | Yes              |
| 7     | P1: read block 1  | Shared (1)              | Shared (1)              | Yes                             | Yes              |
| 8     | P0: write block 1 | Modified (1)            | Invalid                 | Yes                             | No               |
| 9     | P1: write block 1 | Invalid                 | Modified (1)            | Yes                             | No               |
| 10    | P0: read block 0  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 11    | P1: write block 1 | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 12    | P1: read block 1  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 13    | P0: read block 1  | Shared (1)              | Shared (1)              | Yes                             | Yes              |
| 14    | P1: write block 1 |                         |                         |                                 |                  |

33

| Cycle | After Operation   |                         |                         | Memory at block Memory at block |                  |
|-------|-------------------|-------------------------|-------------------------|---------------------------------|------------------|
|       |                   | P0 cache<br>block state | P1 cache<br>block state | 0<br>up to date?                | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1)           | Invalid                 | Yes                             | Yes              |
| 1     | P1: read block 0  | Exclusive (1)           | Exclusive (0)           | Yes                             | Yes              |
| 2     | P0: write block 1 | Modified (1)            | Exclusive (0)           | Yes                             | No               |
| 3     | P0: write block 0 | Modified (0)            | Invalid                 | No                              | Yes              |
| 4     | P1: read block 0  | Shared (0)              | Shared (0)              | Yes                             | Yes              |
| 5     | P1: write block 0 | Invalid                 | Modified (0)            | No                              | Yes              |
| 6     | P0: read block 1  | Exclusive (1)           | Modified (0)            | No                              | Yes              |
| 7     | P1: read block 1  | Shared (1)              | Shared (1)              | Yes                             | Yes              |
| 8     | P0: write block 1 | Modified (1)            | Invalid                 | Yes                             | No               |
| 9     | P1: write block 1 | Invalid                 | Modified (1)            | Yes                             | No               |
| 10    | P0: read block 0  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 11    | P1: write block 1 | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 12    | P1: read block 1  | Exclusive (0)           | Modified (1)            | Yes                             | No               |
| 13    | P0: read block 1  | Shared (1)              | Shared (1)              | Yes                             | Yes              |
| 14    | P1: write block 1 | Invalid                 | Modified (1)            | Yes                             | No               |

34



Consider the following access pattern on a twoprocessor system with a direct-mapped, writeback cache with one cache block and a two cache block memory.

Assume the MESI protocol is used, with writeback caches, write-allocate, and invalidation of other caches on write (instead of updating the value in the other caches).





# Exe 1: Cache Coherency

| Cycle | After Operation   | After Operation P0 cache P1 c |               | 0               | Memory at block |
|-------|-------------------|-------------------------------|---------------|-----------------|-----------------|
| 0     | P0: read block 1  | Exclusive (1)                 | Invalid       | up to date? Yes | up to date? Yes |
| 1     | P1: read block 0  | Exclusive (1)                 | Exclusive (0) | Yes             | Yes             |
| 2     | P0: write block 1 | Modified (1)                  | Exclusive (0) | Yes             | No              |
| 3     | P0: read block 1  | Modified (1)                  | Exclusive (0) | Yes             | No              |
| 4     | P1: read block 0  | Modified (1)                  | Exclusive (0) | Yes             | No              |
| 5     | P0: write block 1 | Modified (1)                  | Exclusive (0) | Yes             | No              |
| 6     | P1: read block 1  | Shared (1)                    | Shared (1)    | Yes             | Yes             |
| 7     | P1: read block 0  | Shared (1)                    | Exclusive (0) | Yes             | Yes             |
| 8     | P0: read block 1  | Shared (1)                    | Shared (1)    | Yes             | Yes             |
| 9     | P0: write block 1 | Modified (1)                  | Invalid (1)   | Yes             | No              |
| 10    | P1: write block 0 | Modified (1)                  | Modified (0)  | No              | No              |
| 11    | P1: read block 1  | Shared (1)                    | Shared (1)    | Yes             | Yes             |





# Exe 1: Cache Coherency

|       |                   | P0 cache                          | P1 cache      | Memory at block  | Memory at block  |
|-------|-------------------|-----------------------------------|---------------|------------------|------------------|
| Cycle | After Operation   | After Operation block state block |               | 0<br>up to date? | 1<br>up to date? |
| 0     | P0: read block 1  | Exclusive (1)                     | Invalid       | Yes              | Yes              |
| 1     | P1: read block 0  | Exclusive (1)                     | Exclusive (0) | Yes              | Yes              |
| 2     | P0: write block 1 | Modified (1)                      | Exclusive (0) | Yes              | No               |
| 3     | P0: read block 1  | Modified (1)                      | Exclusive (0) | Yes              | No               |
| 4     | P1: read block 0  | Modified (1)                      | Exclusive (0) | Yes              | No               |
| 5     | P0: write block 1 | Modified (1)                      | Exclusive (0) | Yes              | No               |
| 6     | P1: read block 1  | Shared (1)                        | Shared (1)    | Yes              | Yes              |
| 7     | P1: read block 0  | Shared (1)                        | Exclusive (0) | Yes              | Yes              |
| 8     | P0: read block 1  | Shared (1)                        | Shared (1)    | Yes              | Yes              |
| 9     | P0: write block 1 | Modified (1)                      | Invalid (1)   | Yes              | No               |
| 10    | P1: write block 0 | Modified (1)                      | Modified (0)  | No               | No               |
| 11    | P0: read block 1  | Shared (1)                        | Shared (1)    | Yes              | Yes              |



# Exe 1: Cache Coherency

|       |                   | P0 cache      | P1 cache      | Memory at block Memory at block |                  |  |
|-------|-------------------|---------------|---------------|---------------------------------|------------------|--|
| Cycle | After Operation   | block state   | block state   | 0<br>up to date?                | 1<br>up to date? |  |
| 0     | P0: read block 1  | Exclusive (1) | Invalid       | Yes                             | Yes              |  |
| 1     | P1: read block 0  | Exclusive (1) | Exclusive (0) | Yes                             | Yes              |  |
| 2     | P0: write block 1 | Modified (1)  | Exclusive (0) | Yes                             | No               |  |
| 3     | P0: read block 1  | Modified (1)  | Exclusive (0) | Yes                             | No               |  |
| 4     | P1: read block 0  | Modified (1)  | Exclusive (0) | Yes                             | No               |  |
| 5     | P0: write block 1 | Modified (1)  | Exclusive (0) | Yes                             | No               |  |
| 6     | P1· read block 1  | Shared (1)    | Shared (1)    | Yes                             | Yes              |  |
| 7     | P1: read block 0  | Shared (1)    | Exclusive (0) | Yes                             | Yes              |  |
| 8     | P0: read block 1  | Shared (1)    | Shared (1)    | Yes                             | Yes              |  |
| 9     | P0: write block 1 | Modified (1)  | Invalid (1)   | Yes                             | No               |  |
| 10    | P1: write block 0 | Modified (1)  | Modified (0)  | No                              | No               |  |
| 11    | P0: read block 1  | Shared (1)    | Shared (1)    | Yes                             | Yes              |  |





### Recall: MESI State Transition Diagram





Cache line in the "acting" processor

Transaction due to events snooped on the common BUS

#### **BUS Transactions**

RH = Read Hit

RMS = Read Miss, Shared

RME = Read Miss, Exclusive

WH = Write Hit

WM = Write Miss SHR = Snoop Hit on a Read

SHW = Snoop Hit on a Write or

Read-with-Intent-to-Modify



= Invalidate Transaction

= Read-with-Intent-to-Modify

= Cache Block Fill





#### Recall

# States of cache lines with MESI

|                               | Modified             | Exclusive         | Shared                                       | Invalid                     |
|-------------------------------|----------------------|-------------------|----------------------------------------------|-----------------------------|
| Line valid?                   | Yes                  | Yes               | Yes                                          | No                          |
| Copy in memory                | Has to be<br>updated | Valid             | Valid                                        | -                           |
| Other copies in other caches? | No                   | No                | Maybe                                        | Maybe                       |
| A write on this line          | Access the<br>BUS    | Access the<br>BUS | Access the<br>BUS and<br>Update the<br>cache | Direct access<br>to the BUS |







#### The Context: ILP limits



#### Class of dependencies:

- Name Dependencies
- Data Dependencies
- Control Dependencies

- → WAR/WAW
- $\rightarrow \mathsf{RAW}$
- → Conditional Branches





#### The Context: ILP limits



Class of donor denotes.

- Name Dependencies
- · Data Dopon
- Control Dependencies

- → WAR/WA V
  - \_\_\_\_\_\_VVVV
- → Conditional Branches





# Explicit Register Renaming Physical RF

```
ld x1, (x3)
addi x3, x1, #4
sub x6, x7, x9
add x3, x3, x6
ld x6, (x1)
add x6, (x1)
add x6, (x1)
ld x6, (x11)
```

| ld P1,  | (Px)      |    |
|---------|-----------|----|
| addi P2 | , P1,     | #4 |
| sub P3, | Py,       | Pz |
| add P4, | P2,       | Р3 |
| ld P5,  | (P1)      |    |
| add P6, | P5,       | P4 |
| sd P6,  | (P1)      |    |
| 14 P7   | ( D 147 ) |    |

| P0  | 32bit |  |
|-----|-------|--|
| P1  |       |  |
| P2  |       |  |
| Р3  |       |  |
|     |       |  |
| P62 |       |  |
| P63 |       |  |



Explicit Register Renaming

```
ld x1, (x3)
addi x3, x1, #4 addi P2, P1, #4
sub x6, x7, x9 sub P3, Py, Pz
add x3, x3, x6 add P4, P2, P3
ld x6, (x1)
add x6, x6, x3 add P6, P5, P4
sd x6, (x1)
ld x6, (x1)
ld P7, (Pw)
```

|     | Joical IXI |
|-----|------------|
| P0  | 32bit      |
| P1  |            |
| P2  |            |
| Р3  |            |
|     |            |
| P62 |            |
| P63 |            |
|     |            |



## EXTRA: What is the problem?



When can we reuse a physical register? a.k.a. Physical Registers Lifetime





## EXTRA: What is the problem?



When can we reuse a physical register? a.k.a. Physical Registers Lifetime

When next writer of same architectural register commits (MOST of NOWADAYS Designs https://doi.org/10.1109/HPCA56546.2023.10071122)





### Exe 3 Scoreboard



Parallel operation in the control data 6600





# Recall: the Scoreboard pipeline

| ISSUE                                  | READ OPERAND                      | EXE COMPLETE                     | WB                                                                                |
|----------------------------------------|-----------------------------------|----------------------------------|-----------------------------------------------------------------------------------|
| Decode<br>instruction;                 | Read operands;                    | Operate on operands;             | Finish exec;                                                                      |
| Structural FUs<br>check;<br>WAW checks | RAW check;<br>WAR if need to read | Notify Scoreboard on completion; | WAR & Struct check<br>(FUs will hold results);<br>Can overlap<br>issue/read&write |





# Recall: the Scoreboard pipeline with Register Renaming

| ISSUE                                                           | READ OPERAND                      | EXE COMPLETE                     | WB                                                                                |
|-----------------------------------------------------------------|-----------------------------------|----------------------------------|-----------------------------------------------------------------------------------|
| Decode instruction; allocate new physical register for result   | Read operands;                    | Operate on operands;             | Finish exec;                                                                      |
| Structural FUs check; WAW checks; free physical registers check | RAW check;<br>WAR if need to read | Notify Scoreboard on completion; | WAR & Struct check<br>(FUs will hold results);<br>Can overlap<br>issue/read&write |

No checks for WAR or WAW hazards!





#### Exe 3 Scoreboard: the Code

I1: LD F6 32+ R2

I2: ADDD F2 F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12 F2 F6





I1: LD F6 32+ R2

I2: ADDD F2 F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12 F2 F6





#### **RAW F6 I1-I2**

I1: LD F6 32+ R2

I2: ADDD F2 F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12 F2 F6





**RAW F6 I1-I2** 

I1: LD(F6) 32+ R2 RAW F6 I1-I4

12: ADDD F2 F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12 F2 F6





I1: LD(F6)32+ R2

12: ADDD F2 F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12 F2(F6)

I5: ADDD F0 F12 F2

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 





I1: LD(F6)32+ R2

I2: ADDD F20F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12(F2)F6

I5: ADDD F0 F12 F2

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

RAW F2 I2-I4





I1: LD(F6)32+ R2

I2: ADDD **£2**F6 F4

I3: MULTD F0 F4 F2

I4: SUBD F12(F2)F6

I5: ADDD F0 F12 F2

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

RAW F2 I2-I4

RAW F2 I2-I5





I1: LD(F6)32+ R2

I2: ADDD **F2** F6 F4

I3: MULTD F0 F4 F2

I4: SUBD (F12) F2 F6

I5: ADDD F0 F12 F2

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

RAW F2 I2-I4

**RAW F2 I2-I5** 





I1: LD(F6)32+ R2

I2: ADDD **F2** F6 F4

I3: MULTO F0 F4 F2

I4: SUBD F12 F2 F6

15: ADDD F0 F120 F2

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

RAW F2 I2-I3

RAW F2 I2-I4

RAW F2 I2-I5

RAW F12 I4-I5

**WAW FO 13-15** 





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------|------|
| 11 | LD F6 32+ R2   |       |                 |                 |    |         |      |
| 12 | ADDD F2 F6 F4  |       |                 |                 |    |         |      |
| 13 | MULTD F0 F4 F2 |       |                 |                 |    |         |      |
| 14 | SUBD F12 F2 F6 |       |                 |                 |    |         |      |
| 15 | ADDD F0 F12 F2 |       |                 |                 |    |         |      |

| F0 | F2 | F4 | F6 | F8 | F10 | F12 | <br>F30 |
|----|----|----|----|----|-----|-----|---------|
| P0 | P2 | P4 | P6 | P8 | P10 | P12 | <br>P30 |

Initialized Rename Table – registers from P32 in the free list

4 FPALU 3 cc latency, single write port for the pool 1 MEM 2 cc latency



**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

**RAW F2 I2-I4** 

**RAW F2 I2-I5** 

|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------|------|
| 11 | LD F6 32+ R2   | 1     |                 |                 |    |         | MU   |
| 12 | ADDD F2 F6 F4  |       |                 |                 |    |         |      |
| 13 | MULTD F0 F4 F2 |       |                 |                 |    |         |      |
| 14 | SUBD F12 F2 F6 |       |                 |                 |    |         |      |
| 15 | ADDD F0 F12 F2 |       |                 |                 |    |         |      |

| F0 | F2 | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|----|----|----|-----|----|-----|-----|---------|
| P0 | P2 | P4 | P32 | P8 | P10 | P12 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

**RAW F2 I2-I4** 

**RAW F2 I2-I5** 





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------|------|
| I1 | LD F6 32+ R2   | 1     | 2               |                 |    |         | MU   |
| 12 | ADDD F2 F6 F4  | 2     |                 |                 |    |         | FPU1 |
| 13 | MULTD F0 F4 F2 |       |                 |                 |    |         |      |
| 14 | SUBD F12 F2 F6 |       |                 |                 |    |         |      |
| 15 | ADDD F0 F12 F2 |       |                 |                 |    |         |      |

| F0 | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|----|-----|----|-----|----|-----|-----|---------|
| P0 | P33 | P4 | P32 | P8 | P10 | P12 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

**RAW F6 I1-I2** 

**RAW F6 I1-I4** 

**RAW F2 I2-I3** 

**RAW F2 I2-I4** 

**RAW F2 I2-I5** 





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------|------|
| 11 | LD F6 32+ R2   | 1     | 2               |                 |    |         | MU   |
| 12 | ADDD F2 F6 F4  | 2     |                 |                 |    | RAW F6  | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    |         | FPU2 |
| 14 | SUBD F12 F2 F6 |       |                 |                 |    |         |      |
| 15 | ADDD F0 F12 F2 |       |                 |                 |    |         |      |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P34 | P33 | P4 | P32 | P8 | P10 | P12 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency



RAW F2 I2-I3

RAW F2 I2-I4

RAW F2 I2-I4 RAW F2 I2-I5





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               |    |         | MU   |
| 12 | ADDD F2 F6 F4  | 2     |                 |                 |    | RAW F6  | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2  | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    |         | FPU3 |
| 15 | ADDD F0 F12 F2 |       |                 |                 |    |         |      |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P34 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

**RAW F6 I1-I2** 

RATATION

PAW F2 12-13

**RAW F2 I2-I4** 

**RAW F2 I2-I5** 





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards    | Unit |
|----|----------------|-------|-----------------|-----------------|----|------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |            | MU   |
| 12 | ADDD F2 F6 F4  | 2     |                 |                 |    | RAW F6     | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2     | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    | RAW F2, F6 | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    |            | FPU4 |

| F0  | F2 | F4 | F6 | F8 | F10 | F12 | <br>F30 |
|-----|----|----|----|----|-----|-----|---------|
| P36 |    |    |    |    |     |     |         |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency







|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards    | Unit |
|----|----------------|-------|-----------------|-----------------|----|------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |            | MU   |
| 12 | ADDD F2 F6 F4  | 2     |                 |                 |    | RAW F6     | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2     | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    | RAW F2, F6 | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    |            | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

WAW F0 I3-I5 solved by register renaming (even not listed in rename table)

RAW **F6** 11-12

**RAW F6 I1-I4** 

RAVV----13

RAW F2 I2-I4

RAVV 12 12-15





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit |
|----|----------------|-------|-----------------|-----------------|----|-------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |             | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               |                 |    | RAW F6      | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2      | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    | RAW F2, F6  | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12 | FPU4 |

| F0  | F2 | F4 | F6 | F8 | F10 | F12 | <br>F30 |
|-----|----|----|----|----|-----|-----|---------|
| P36 |    |    |    |    |     |     |         |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

RAW F6 I1-I4

RAW F2 I2-I3

PAW F2 I2-I5

RAW F12 I4-I5





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit |
|----|----------------|-------|-----------------|-----------------|----|-------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |             | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               |    | RAW F6      | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2      | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    | RAW F2, F6  | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12 | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

RAW F6 I1-I4

RAW F2 I2-I3

RAW F2 I2-I4

RAW F2 I2-I5





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit |
|----|----------------|-------|-----------------|-----------------|----|-------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |             | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6      | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     |                 |                 |    | RAW F2      | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     |                 |                 |    | RAW F2, F6  | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12 | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency







|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit |
|----|----------------|-------|-----------------|-----------------|----|-------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |             | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6      | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              |                 |    | RAW F2      | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              |                 |    | RAW F2, F6  | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12 | FPU4 |

| F0  | F2 | F4 | F6 | F8 | F10 | F12 | <br>F30 |
|-----|----|----|----|----|-----|-----|---------|
| P36 |    |    |    |    |     |     |         |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

RAW F6 I1-I2

RAW F6 I1-I4

RAW F2 I2-I3

RAW F2 I2-I4

RAW F2 I2-I5

RAW F12 I4-I5





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit |
|----|----------------|-------|-----------------|-----------------|----|-------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |             | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6      | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              | 14              |    | RAW F2      | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              | 14              |    | RAW F2, F6  | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12 | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency







|   |   | Instruction     | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards     | Unit   |
|---|---|-----------------|-------|-----------------|-----------------|----|-------------|--------|
| ı | 1 | LD F6 32+ R2    | 1     | 2               | 4               | 5  |             | MU     |
| 1 | 2 | ADDD F2 F6 F4   | 2     | 6               | 9               | 10 | RAW F6      | FPU1   |
| Į | 3 | И.L го Fb F4 F2 | A     | 11              | 14              |    | NEAN AIR    | 1 FAYE |
|   | 4 | 5 JND F12 F2 F6 | /4/   | 11              | т4              | K  | FAW F2.F5   | FVL3   |
| ı | 5 | ADDD F0 F12 F2  | 5     |                 |                 |    | RAW #2, F12 | FPU4   |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

RAW F6 I1-I2

RAW F6 I1-I4

RAW F2 I2-I3

RAW F2 I2-I4

RAW F2 I2-I5

RAW F12 I4-I5





#### Exe 3.3 Scoreboard



|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards                   | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------------------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |                           | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6                    | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              | 14              | 15 | RAW F2                    | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              | 14              |    | RAW F2, F6 +<br>Struct RF | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12               | FPU4 |

| F  | ) | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|----|---|-----|----|-----|----|-----|-----|---------|
| Р3 | 6 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency







|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards                   | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------------------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |                           | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6                    | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              | 14              | 15 | RAW F2                    | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              | 14              | 16 | RAW F2, F6 +<br>Struct RF | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     |                 |                 |    | RAW F2, F12               | FPU4 |

| F  | ) | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|----|---|-----|----|-----|----|-----|-----|---------|
| Р3 | 6 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency







|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards                   | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------------------------|------|
| 11 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |                           | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6                    | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              | 14              | 15 | RAW F2                    | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              | 14              | 16 | RAW F2, F6 +<br>Struct RF | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     | 17              |                 |    | RAW F2, F12               | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency

RAW F6 11-12
RAW F6 I1-I4
RAW F2 I2-I3
RAW F2 I2-I4
RAW F2 I2-I5
RAW F12 I4-I5





|    | Instruction    | ISSUE | READ<br>OPERAND | EXE<br>COMPLETE | WB | Hazards                   | Unit |
|----|----------------|-------|-----------------|-----------------|----|---------------------------|------|
| I1 | LD F6 32+ R2   | 1     | 2               | 4               | 5  |                           | MU   |
| 12 | ADDD F2 F6 F4  | 2     | 6               | 9               | 10 | RAW F6                    | FPU1 |
| 13 | MULTD F0 F4 F2 | 3     | 11              | 14              | 15 | RAW F2                    | FPU2 |
| 14 | SUBD F12 F2 F6 | 4     | 11              | 14              | 16 | RAW F2, F6 +<br>Struct RF | FPU3 |
| 15 | ADDD F0 F12 F2 | 5     | 17              | 20              | 21 | RAW F2, F12               | FPU4 |

| F0  | F2  | F4 | F6  | F8 | F10 | F12 | <br>F30 |
|-----|-----|----|-----|----|-----|-----|---------|
| P36 | P33 | P4 | P32 | P8 | P10 | P35 | <br>P30 |

4 FPALU 3 cc latency, <u>single write</u> port for the pool 1 MEM 2 cc latency









# Thank you for your attention Questions?

Alessandro Verosimile <alessandro.verosimile@polimi.it>

#### Acknowledgements

Davide Conficconi, E. Del Sozzo, Marco D. Santambrogio, D. Sciuto Part of this material comes from:

- "Computer Organization and Design" and "Computer Architecture A Quantitative Approach" Patterson and Hennessy books
- News and paper cited throughout the lecture

and are *properties of their respective owners* 





#### Problems:

- Scheduled loops require lots of registers,
- Lots of duplicated code in prolog, epilog

#### Problems:

- Scheduled loops require lots of registers,
- Lots of duplicated code in prolog, epilog

#### Solution:

Allocate new set of registers for each loop iteration

- Rotating Register Base (RRB) register points to base of current register set.
- Value added on to logical register specifier to give physical register number.



 Usually, split into rotating and non-rotating registers.

# Loop Example





for (i=0; i<N; i++)

B[i] = A[i] + C;

No dep on FP in consecutive iterations







Int1 Int 2 M1 M2 FP+ FPx



ld f1, ()

fadd fx, fy, ...

sd fz, ()

bloop

| ld f1, () | fadd fx, fy, | sd fz, () | bloop |
|-----------|--------------|-----------|-------|
|-----------|--------------|-----------|-------|

| ld f1, () | fadd fx, fy, | sd fz, () | bloop |
|-----------|--------------|-----------|-------|
|-----------|--------------|-----------|-------|

Three cycle load latency encoded as difference of 3 in register specifier number

Four cycle fadd latency encoded as difference of 4 in register specifier number

| ld f1, () | fadd fx, fy, | sd fz, () | bloop |
|-----------|--------------|-----------|-------|
|-----------|--------------|-----------|-------|

Three cycle load latency encoded as difference of 3 in register specifier number

(f1 + 3 = fy... y = 3+1)

Four cycle fadd latency encoded as difference of 4 in register specifier number (f5 + 4= fz... z=5+4)

ld f1, () fadd f5, f4, ... sd f9, () bloop

loop: ld f1, ...
...
fadd f2, f0, f1
sd f2, ...
...

sd f9, ()

bloop

Three cycle load latency encoded as difference of 3 in register specifier number (f1 + 3 = fy... y= 3+1)

Four cycle fadd latency encoded as difference of 4 in register specifier number (f5 + 4= fz... z=5+4)

fadd f5, f4, ...

ld f1, ()

Three cycle load latency encoded as difference of 3 in register specifier number (f1 + 3 = fy... y= 3+1) Four cycle fadd latency encoded as difference of 4 in register specifier number (f5 + 4= fz... z=5+4)

| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |
|-----------|----------------|------------|-------|
| ld P9, () | fadd P13, P12, | sd P17, () | bloop |

RRB=8

Three cycle load latency encoded as difference of 3 in register specifier number (f1 + 3 = fy... y= 3+1) Four cycle fadd latency encoded as difference of 4 in register specifier number (f5 + 4= fz... z=5+4)

|           |                | <u> </u>   |       |
|-----------|----------------|------------|-------|
| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |
| ld P9, () | fadd P13, P12, | sd P17, () | bloop |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop |

RRB=8

RRB=7

Three cycle load latency encoded as difference of 3 in register specifier number

(f1 + 3 = fy... y= 3+1)

|           |                | <b>k</b>   |       | _     |
|-----------|----------------|------------|-------|-------|
| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |       |
| ld P9, () | fadd P13, P12, | sd P17, () | bloop | RRB=8 |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop | RRB=7 |
| ld P7, () | fadd P11, P10, | sd P15, () | bloop | RRB=6 |

Three cycle load latency encoded as difference of 3 in register specifier number

$$(f1 + 3 = fy... y= 3+1)$$

| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |       |
|-----------|----------------|------------|-------|-------|
| ld P9, () | fadd P13, P12, | sd P17, () | bloop | RRB=8 |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop | RRB=7 |
| ld P7, () | fadd P11, P10, | sd P15, () | bloop | RRB=6 |
| ld P6, () | fadd P10, P9,  | sd P14, () | bloop | RRB=5 |

Four cycle fadd latency Three cycle load latency encoded as difference of 4 in encoded as difference of 3 in register specifier number register specifier number (f5 + 4 = fz... z = 5 + 4)(f1 + 3 = fy... y = 3+1)fadd f5, f4, ... ld f1, () sd f9, () bloop ld P9, (). fadd P13, P12, sd P17, () bloop RRB=8 fadd P12, P11, ld P8, () sd P16, () RRB=7 bloop ld P7, () fadd P11, P10, sd P15, () RRB=6 bloop fadd P10, P9, sd P14, () ld P6, () bloop RRB=5

RRB=4

#### Rotating Register File

Four cycle fadd latency Three cycle load latency encoded as difference of 4 in encoded as difference of 3 in register specifier number register specifier number (f5 + 4 = fz... z = 5 + 4)(f1 + 3 = fy... y = 3+1)fadd f5, f4, ... ld f1, () sd f9, () bloop ld P9, (). fadd P13, P12, sd P17, () bloop RRB=8 ld P8, () fadd P12, P11, sd P16, () RRB=7 bloop ld P7, () fadd P11, P10, sd P15, () RRB=6 bloop fadd P10, P9, ld P6, () sd P14, () bloop RRB=5

sd P13, ()

bloop

fadd P9, P8,

ld P5, ()

Three cycle load latency encoded as difference of 3 in register specifier number (f1 + 3 = fy... y= 3+1)

| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |       |
|-----------|----------------|------------|-------|-------|
| ld P9, () | fadd P13, P12, | sd P17, () | bloop | RRB=8 |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop | RRB=7 |
| ld P7, () | fadd P11, P10, | sd P15, () | bloop | RRB=6 |
| ld P6, () | fadd P10, P9,  | sd P14, () | bloop | RRB=5 |
| ld P5, () | fadd P9, P8,   | sd P13, () | bloop | RRB=4 |
| ld P4, () | fadd P8, P7,   | sd P12, () | bloop | RRB=3 |

Three cycle load latency encoded as difference of 3 in register specifier number

(f1 + 3 = fy... y= 3+1)

|           |                | <b>K</b>   |       | _     |
|-----------|----------------|------------|-------|-------|
| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |       |
| ld P9, () | fadd P13, P12, | sd P17, () | bloop | RRB=8 |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop | RRB=7 |
| ld P7, () | fadd P11, P10, | sd P15, () | bloop | RRB=6 |
| ld P6, () | fadd P10, P9,  | sd P14, () | bloop | RRB=5 |
| ld P5, () | fadd P9, P8,   | sd P13, () | bloop | RRB=4 |
| ld P4, () | fadd P8, P7,   | sd P12, () | bloop | RRB=3 |
| ld P3, () | fadd P7, P6,   | sd P11, () | bloop | RRB=2 |
|           |                |            |       |       |

Three cycle load latency encoded as difference of 3 in register specifier number (f1 + 3 = fy... y= 3+1)

| ld f1, () | fadd f5, f4,   | sd f9, ()  | bloop |       |
|-----------|----------------|------------|-------|-------|
| ld P9, () | fadd P13, P12, | sd P17, () | bloop | RRB=8 |
| ld P8, () | fadd P12, P11, | sd P16, () | bloop | RRB=7 |
| ld P7, () | fadd P11, P10, | sd P15, () | bloop | RRB=6 |
| ld P6, () | fadd P10, P9,  | sd P14, () | bloop | RRB=5 |
| ld P5, () | fadd P9, P8,   | sd P13, () | bloop | RRB=4 |
| ld P4, () | fadd P8, P7,   | sd P12, () | bloop | RRB=3 |
| ld P3, () | fadd P7, P6,   | sd P11, () | bloop | RRB=2 |
| ld P2, () | fadd P6, P5,   | sd P10, () | bloop | RRB=1 |

Four cycle fadd latency Three cycle load latency encoded as difference of 4 in encoded as difference of 3 in register specifier number register specifier number (f5 + 4 = fz... z = 5 + 4)(f1 + 3 = fy... y = 3+1)fadd f5, f4, ... ld f1, () sd f9, () bloop ld P9, (). fadd P13, P12, sd P17, () bloop RRB=8 ld P8, () fadd P12, P11, sd P16, () RRB=7 bloop ld P7, () fadd P11, P10, sd P15, () RRB=6 bloop fadd P10, P9, ld P6, () sd P14, () bloop RRB=5 fadd P9, P8, sd P13, () RRB=4 ld P5, () bloop fadd P8, P7, sd P12, () RRB=3 ld P4, () bloop sd P11, () fadd P7, P6, bloop RRB=2 ld P3, () sd P10, () RRB=1 ld P2, () fadd P6, P5, bloop

From Computer Desktop Encyclopedia











